The goal of this project is to build a multiple regression model using data from the 2022-2023 NHL season to predict how many points skaters should have given their individual statistics. We split the skaters into three categories: forwards (\(f\)), defensemen (\(d\)), and forwards on the powerplay (\(pp\)). Our models have the following adjusted \(R^2\) values: \(R^2_f = 0.9585\), \(R^2_d = 0.925\), and \(R^2_{pp} = 0.9179\). To check the effectiveness of our models, we find the correlation between predicted 2024 point totals and actual 2024 point totals for each category using current data from the 2023-2024 season. We find that \(r_f = 0.9750649\), \(r_d = 0.9518646\), and \(r_{pp} = 0.9415112\) which is strong evidence to suggest that our models are accurately predicting point totals for NHL skaters using current individual statistics.
The question initially asked is what factors cause a given National Hockey League skater to produce points. In the National Hockey League (NHL), there are two groups of players: skaters and goalies. For our purposes, we only examine skaters as they are the players most likely to produce points owing to the fact only 17 goals have been scored by a goalie in NHL history and the points produced by goalies are usually minimal. A point is either the goal scored, the primary assist (meaning that the player was the last player on the scoring team to posses the puck before the goal-scorer) and the secondary assist (the player on the goal-scoring team to possess the puck before the primary assistant). Each goal has an opportunity of having up to three points awarded to that team, as is most often the case, but can also have two or one points awarded depending on the situation. A player’s ice time is limited by shifts, the number of times the player is on the ice for an extended period of time. Hockey is a fast paced game with changes on the fly, thus the amoutn of shifts is difrectly correlated to ice time. Obviously, producing a point greatly benefits the chance of victory and has a large impact on the winning percentage of a team. Because of the nature of hockey, specifically the faster action and long puck possessions on potential scoring opportunities, we decided to use points as the dependent variable of choice as opposed to goals as points are much more common. We find our data from Moneypuck.com, and looked specifically at their data from the 2022-2023 NHL season to run our regression models and the 2023-2024 data for our correlation to compare our model across seasons.
We initially decide to use the individual points statistic for our dependent variable. After this, we divide the raw data to look at every time a given player is on the ice, regardless of the situation. After this, we split into three distinct datasets: one for forwards, one for defensemen, and one for forwards on the powerplay. This occurs when a penalty on the other team has been called and the team has a man advantage, causing an increase in the likelihood of points being scored.Forwards typically have a much higher point output than defensemen. We then create box plots of each dataset to gauge the average point output by each group as well as look for outliers. For the forwards on the powerplay, we look at shifts instead of points as the amount of times a give player is on the powerplay is the limiting factor.
Unsurprisingly, the best players in the league are the outliers in both position groups. Forwards such as Connor McDavid, Nathan MacKinnon, and David Pastranak all are well above highest whisker because of their superstar nature. They are freak athletes who perform well above the “average” NHL skater. Similarly, defensemen such as Erik Karlsson, Cale Makar, and Roman Josi all outperform the “average” defensemen, being well above the top whisker. There were also approximately 100 in each positional category filtered out for too few games played.
We then filter the data to impose an arbitrary minimum of games played (60) so the statistics would be more wholistically representative of a season and players who averaged almost 3 points per game for two games would not be included.
We decided to filter the shifts to be less than 125 to gain a representative sample of the data, similar to why we excluded less than 60 games. The shifts were the determining factor and not the games played. There were no large outliers for the shifts dataset, and there are approximately 200 forwards on the power play filtered out for too few shifts to be of significance. .
The dataset keeps track of 154 individual statistics, most of which have no effect on the number of points a player scores (i.e. name, hits, shots blocked, etc). So our first step for each category is to hand pick the statistics that we felt would actually contribute to a player’s point total.
For forwards, we decide to look at variables which contributed to a heavy number of shots on goal, offensive zone possession, and ice time. The variables we select for forwards are shown in the code below.
f_reg_all <- lm(I_F_points ~ I_F_shotsOnGoal +
I_F_missedShots +
I_F_blockedShotAttempts +
I_F_rebounds +
I_F_freeze +
I_F_playStopped +
I_F_playContinuedInZone +
I_F_savedShotsOnGoal +
I_F_savedUnblockedShotAttempts +
I_F_penalityMinutes +
I_F_faceOffsWon +
I_F_giveaways +
I_F_takeaways +
I_F_oZoneShiftStarts +
I_F_neutralZoneShiftStarts +
penaltiesDrawn +
timeOnBench,
data = forwards
)After hand selecting statistics that we think would contribute to a forward’s point total, we run a multiple regression and use backward elimination on non-significant variables and rerun the model until all the variables left were considered significant at the \(\alpha = 0.05\) level.
##
## Call:
## lm(formula = I_F_points ~ I_F_shotsOnGoal + I_F_blockedShotAttempts +
## I_F_freeze + I_F_savedShotsOnGoal + I_F_penalityMinutes +
## I_F_faceOffsWon + I_F_giveaways + I_F_takeaways + I_F_oZoneShiftStarts +
## I_F_neutralZoneShiftStarts + penaltiesDrawn + timeOnBench,
## data = forwards)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.617 -2.259 -0.104 2.038 29.641
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.961e-01 4.419e-01 -0.444 0.657411
## I_F_shotsOnGoal 1.032e+00 5.453e-02 18.920 < 2e-16 ***
## I_F_blockedShotAttempts 5.369e-02 2.084e-02 2.577 0.010213 *
## I_F_freeze -1.301e-01 4.843e-02 -2.686 0.007425 **
## I_F_savedShotsOnGoal -1.003e+00 6.189e-02 -16.200 < 2e-16 ***
## I_F_penalityMinutes -4.507e-02 1.779e-02 -2.534 0.011540 *
## I_F_faceOffsWon 4.814e-03 1.313e-03 3.666 0.000269 ***
## I_F_giveaways 2.375e-01 2.465e-02 9.636 < 2e-16 ***
## I_F_takeaways 1.092e-01 2.535e-02 4.307 1.93e-05 ***
## I_F_oZoneShiftStarts 5.258e-02 5.165e-03 10.181 < 2e-16 ***
## I_F_neutralZoneShiftStarts 1.493e-02 6.366e-03 2.346 0.019307 *
## penaltiesDrawn 1.657e-01 4.758e-02 3.482 0.000534 ***
## timeOnBench -3.830e-05 6.522e-06 -5.873 7.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.403 on 604 degrees of freedom
## Multiple R-squared: 0.9593, Adjusted R-squared: 0.9585
## F-statistic: 1186 on 12 and 604 DF, p-value: < 2.2e-16
The final reduced model has an \(R^2_f\) of 0.9585, indicating a very accurate fit for forwards.
We plot a histogram of the residuals to determine if the model’s residuals are normally distributed and a scatterplot to ensure there is no pattern.
The histogram indicates that the residuals are roughly normally distributed which is a requirement for our assumptions. The scatterplot shows increasing variation which means that we cannot make extremely accurate inferences using the model. However, the high \(R^2\) value indicates that the predictive nature of the model still holds. To verify, we use our model and data from the 2023-24 season to find the predicted point totals for each forward and plotted predcted points against actual points.
To do this, we create datasets comperable to the 2022-2023 data from the 2023-2024 data. We then plot the predicted points using our model on the x axis and the actual points scored by each player on the y-axis. We also include a line of best fit to indicate how far a player deviates from expected. Players above the line of best fit are overperformers compared to the model and players below the line are underperformers compared to the model.
The correlation between predicted points and actual points for forwards is:
## [1] 0.9750649
The high correlation between the variables as well as the high \(R^2\) value for the model is evidence that the predictive nature of the model is intact regardless of the variation in the residuals.
We then used a one-way ANOVA test to make sure that the final model with fewer variables is indeed better than the model with many unnecessary variables.
## Analysis of Variance Table
##
## Model 1: I_F_points ~ I_F_shotsOnGoal + I_F_missedShots + I_F_blockedShotAttempts +
## I_F_rebounds + I_F_freeze + I_F_playStopped + I_F_playContinuedInZone +
## I_F_savedShotsOnGoal + I_F_savedUnblockedShotAttempts + I_F_penalityMinutes +
## I_F_faceOffsWon + I_F_giveaways + I_F_takeaways + I_F_oZoneShiftStarts +
## I_F_neutralZoneShiftStarts + penaltiesDrawn + timeOnBench
## Model 2: I_F_points ~ I_F_shotsOnGoal + I_F_blockedShotAttempts + I_F_freeze +
## I_F_savedShotsOnGoal + I_F_penalityMinutes + I_F_faceOffsWon +
## I_F_giveaways + I_F_takeaways + I_F_oZoneShiftStarts + I_F_neutralZoneShiftStarts +
## penaltiesDrawn + timeOnBench
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 600 17422
## 2 604 17630 -4 -207.08 1.7829 0.1306
The large p-value indicates that the model with fewer variables is better than the first model with many variables.
As with forwards, we hand select statistics that we felt contributed to point production for defensemen. The variables are available in the code below.
d_reg_all <- lm(I_F_points ~ I_F_shotsOnGoal +
I_F_missedShots +
I_F_blockedShotAttempts +
I_F_rebounds +
I_F_freeze +
I_F_playStopped +
I_F_playContinuedInZone +
I_F_savedShotsOnGoal +
I_F_savedUnblockedShotAttempts +
I_F_penalityMinutes +
I_F_hits +
shotsBlockedByPlayer +
I_F_giveaways +
I_F_takeaways +
I_F_oZoneShiftStarts +
I_F_neutralZoneShiftStarts +
penaltiesDrawn +
timeOnBench,
data = d_men
)We use the same process to use backwards elimination to reduce the variables used in the model.
##
## Call:
## lm(formula = I_F_points ~ I_F_shotsOnGoal + I_F_blockedShotAttempts +
## I_F_rebounds + I_F_savedShotsOnGoal + I_F_giveaways + I_F_takeaways +
## I_F_oZoneShiftStarts + penaltiesDrawn + timeOnBench, data = d_men)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9977 -1.7179 -0.0467 1.9934 24.4666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.805e-02 5.163e-01 -0.054 0.956698
## I_F_shotsOnGoal 9.730e-01 1.402e-01 6.943 2.11e-11 ***
## I_F_blockedShotAttempts 7.350e-02 2.060e-02 3.568 0.000414 ***
## I_F_rebounds -2.981e-01 1.089e-01 -2.737 0.006542 **
## I_F_savedShotsOnGoal -9.510e-01 1.471e-01 -6.463 3.77e-10 ***
## I_F_giveaways 8.305e-02 2.674e-02 3.106 0.002062 **
## I_F_takeaways 1.238e-01 3.029e-02 4.088 5.49e-05 ***
## I_F_oZoneShiftStarts 9.472e-02 6.785e-03 13.961 < 2e-16 ***
## penaltiesDrawn 1.874e-01 7.263e-02 2.581 0.010301 *
## timeOnBench -5.877e-05 7.569e-06 -7.765 1.08e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.814 on 324 degrees of freedom
## Multiple R-squared: 0.927, Adjusted R-squared: 0.925
## F-statistic: 457.3 on 9 and 324 DF, p-value: < 2.2e-16
The resultant \(R^2_d\) is 0.925 which indicates that about 92.5% of the variation in points scored can be explained by the model.
We plot the residuals as a histogram to check for normality and on a scatterplot to look for patterns.
Again, the histogram indicates that the residuals are normal, but the scatterplot shows an increase in variation. To verify the model, just like with the forwards we use our model for defensemen and data from the 2023-24 season to plot predicted goals against actual goals for all defensemen.
For the defensemen, the correlation is:
## [1] 0.9518646
The high correlation between the variables as well as the high \(R^2\) value for the model is evidence that the predictive nature of the model is preserved regardless of the variation in the residuals.
We did a one-way ANOVA test on the final model with less variables and the first model with many variables to make sure that eliminating variables in fact produced a better model.
## Analysis of Variance Table
##
## Model 1: I_F_points ~ I_F_shotsOnGoal + I_F_blockedShotAttempts + I_F_rebounds +
## I_F_savedShotsOnGoal + I_F_giveaways + I_F_takeaways + I_F_oZoneShiftStarts +
## penaltiesDrawn + timeOnBench
## Model 2: I_F_points ~ I_F_shotsOnGoal + I_F_missedShots + I_F_blockedShotAttempts +
## I_F_rebounds + I_F_freeze + I_F_playStopped + I_F_playContinuedInZone +
## I_F_savedShotsOnGoal + I_F_savedUnblockedShotAttempts + I_F_penalityMinutes +
## I_F_hits + shotsBlockedByPlayer + I_F_giveaways + I_F_takeaways +
## I_F_oZoneShiftStarts + I_F_neutralZoneShiftStarts + penaltiesDrawn +
## timeOnBench
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 324 7507
## 2 316 7251 8 256 1.3946 0.1979
The high p-value (p=0.1979) leads us to draw the same conclusion as before.
As with forwards and defensemen, we hand select statistics that we felt contribute to point production for forwards on the power play. The variables are available in the code below.
fpp_reg_all <- lm(I_F_points ~ I_F_shotsOnGoal +
I_F_missedShots +
I_F_blockedShotAttempts +
I_F_rebounds +
I_F_freeze +
I_F_playStopped +
I_F_playContinuedInZone +
I_F_savedShotsOnGoal +
I_F_savedUnblockedShotAttempts +
I_F_penalityMinutes +
I_F_faceOffsWon +
I_F_giveaways +
I_F_takeaways +
I_F_oZoneShiftStarts +
I_F_neutralZoneShiftStarts +
penaltiesDrawn +
timeOnBench,
data = forwardspp
)We systematicly eliminate variables until we are left with variables that were all within the desired level of significance (\(\alpha = 0.05\)). The code below shows which variables made the cut.
##
## Call:
## lm(formula = I_F_points ~ I_F_shotsOnGoal + I_F_missedShots +
## I_F_blockedShotAttempts + I_F_savedShotsOnGoal + I_F_giveaways +
## I_F_oZoneShiftStarts + I_F_neutralZoneShiftStarts + penaltiesDrawn +
## timeOnBench, data = forwardspp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4298 -0.6012 -0.0318 0.4460 21.9216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.933e-02 1.729e-01 0.285 0.775523
## I_F_shotsOnGoal 1.160e+00 6.001e-02 19.329 < 2e-16 ***
## I_F_missedShots -1.290e-01 3.579e-02 -3.604 0.000339 ***
## I_F_blockedShotAttempts 7.274e-02 2.319e-02 3.137 0.001790 **
## I_F_savedShotsOnGoal -1.068e+00 6.953e-02 -15.356 < 2e-16 ***
## I_F_giveaways 4.422e-01 4.439e-02 9.962 < 2e-16 ***
## I_F_oZoneShiftStarts 2.808e-02 6.804e-03 4.127 4.19e-05 ***
## I_F_neutralZoneShiftStarts 1.274e-01 3.781e-02 3.370 0.000799 ***
## penaltiesDrawn 3.298e-01 1.292e-01 2.552 0.010945 *
## timeOnBench -1.551e-04 3.011e-05 -5.151 3.51e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.554 on 607 degrees of freedom
## Multiple R-squared: 0.9198, Adjusted R-squared: 0.9186
## F-statistic: 773.4 on 9 and 607 DF, p-value: < 2.2e-16
The final model had an \(R^2_{pp}\) of 0.9179.
As before, we plot residuals for forwards on the powerplay on a histogram and on a scatterplot.
Once again, the histogram shows that the residuals are normally distributed, but there is not constant variability in them. Therefore, we use the same method as with forwards and defensemen to plot predicted power play points for forwards against actual power play points for forwards.
For forwards on the powerplay, the correlation is:
## [1] 0.9415112
The high correlation between the variables as well as the high \(R^2\) value for the model is evidence that the predictive nature of the model is preserved regardless of the variation in the residuals.
To check whether the final model with less variables was actually better than the first model with more variables, we run a one-way ANOVA test on both models.
## Analysis of Variance Table
##
## Model 1: I_F_points ~ I_F_shotsOnGoal + I_F_missedShots + I_F_blockedShotAttempts +
## I_F_savedShotsOnGoal + I_F_giveaways + I_F_oZoneShiftStarts +
## I_F_neutralZoneShiftStarts + penaltiesDrawn + timeOnBench
## Model 2: I_F_points ~ I_F_shotsOnGoal + I_F_missedShots + I_F_blockedShotAttempts +
## I_F_rebounds + I_F_freeze + I_F_playStopped + I_F_playContinuedInZone +
## I_F_savedShotsOnGoal + I_F_savedUnblockedShotAttempts + I_F_penalityMinutes +
## I_F_faceOffsWon + I_F_giveaways + I_F_takeaways + I_F_oZoneShiftStarts +
## I_F_neutralZoneShiftStarts + penaltiesDrawn + timeOnBench
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 607 3958.6
## 2 600 3945.9 7 12.719 0.2763 0.9631
The high p-value (p=0.9631) is strong evidence that the model with less variables is better than the model with more variables.
We create three multiple regression models that predict point totals for forwards, defensemen, and forwards on the powerplay using other individual stats. These models are not only accurate, but are robust to outlying skaters as there was no filtering statements on the 2023-2024 datasets. These models use stats from 2022-2023 to predict points for 2023-2024 and use current stats to calculate current expected point totals. Consequently, in application, these models are weakened by the fact that they are not actually predictive for the future, but rather just indicate how a player is performing compared to where they should be given their productivity in other stat categories. These models essentially use other individual stats to grade point production by a given skater. If a player has more points than the model predicts, then the player is overperforming compared to their other stats. Similarly, players who have fewer points than the model predicts are underperforming compared to their other skaters. Areas for further research would include building models for other special teams situations such as shorthanded situations. Furthermore, further research could focus on each stat individually over time to build regression models in order to predict future years’ stat totals and thus predict point totals for players going forward. It could also be interesting to explore which players over or under performed the most and see if there may be other factors which would cause these larger deviations from the model.